AITopics

2607.00252

Country:

Europe (0.92)
North America > United States > California (0.27)

Genre:

Research Report (0.64)
Workflow (0.45)

Industry: Banking & Finance > Insurance (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.60)

arXiv.org Machine LearningJul-1-2026

Random Reshuffling Dominates Stochastic Gradient Descent

Liu, Zijian

Stochastic Gradient Descent ($\textsf{SGD}$) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of $\textsf{SGD}$ differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent ($\textsf{Shuffling SGD}$). A particularly popular strategy in $\textsf{Shuffling SGD}$ is Random Reshuffling ($\textsf{RR}$), which has achieved great empirical success across numerous experiments. Despite its strong performance, $\textsf{RR}$ has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for $\textsf{RR}$, thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of $\textsf{RR}$ remain to this day. More precisely, according to the current theory, $\textsf{Shuffling SGD}$ under $\textsf{RR}$ converges only when the stepsize is smaller than a threshold proportional to $1/n$, where $n$ is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of $\textsf{Shuffling SGD}$ under $\textsf{RR}$ is strictly worse than that of $\textsf{SGD}$ when the number of epochs is smaller than another threshold proportional to $n$. These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that $\textsf{RR}$ dominates $\textsf{SGD}$ in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.

artificial intelligence, machine learning, shuffling sgd, (16 more...)

2606.32005

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Martinez-Sermeno, Flor, Jaramillo, Arturo, Van Horebeek, Johan

Adjusted Wasserstein distances for bridging empirical and true distributions with applications to MDS

arXiv.org Machine LearningJun-30-2026

This paper examines how metric adjustments to Multidimensional Scaling (MDS) can enhance its effectiveness as a visual tool for pattern recognition. The distance under consideration, referred to as Max-D-SW, is an adjustment of the Max-Sliced Wasserstein distance. In contrast to the original formulation, which optimizes over single unit directions, Max-D-SW aggregates contributions over orthonormal bases. This modification provides a clear numerical advantage in MDS outcomes, particularly when applied to heavy-tailed distributions. We also establish sample-complexity bounds showing that Max-D-SW remains statistically tractable, with rates comparable to those of its max-sliced counterpart. Moreover, we show that a better sample complexity for a metric does not necessarily translate into better performance when the metric is used as an input for MDS.

artificial intelligence, machine learning, pattern recognition, (16 more...)

2606.29665

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.34)

Koukorinis, Andreas, Silva, Ricardo

Doubly Robust Adaptive Conformal Inference for Causal Effects Under Temporal Dependence

arXiv.org Machine LearningJun-30-2026

We propose doubly robust adaptive conformal inference (DR-ACI), which constructs prediction intervals for doubly robust pseudo-outcomes under temporal dependence. Calibration targets the pseudo-outcome ψDRt; under estimator consistency, this yields asymptotically conservative CATE containment (Corollary 6). Temporal block cross-fitting preserves switch-coefficient mixing bounds and the DML product-bias rate up to an explicit coupling remainder.

adaptive conformal inferencefor causal effect, artificial intelligence, machine learning, (14 more...)

2606.305

Genre: Research Report (0.83)

Industry: Banking & Finance > Trading (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.46)

Liu, Shixiang, Yang, Hanming

Adversarial Contamination Meets Hard Thresholding: An Iterative Algorithm with Signal Adaptivity and Minimax Optimality

arXiv.org Machine LearningJun-29-2026

Pervasive data contamination -- stemming from measurement errors, outliers, or adversarial corruption -- has motivated the development of robust statistical methods. In this context, we propose a two-stage Adversarial Contamination-resistant Iterative Hard Thresholding (AC-IHT) algorithm for high-dimensional regression with contamination. Our nonconvex algorithm achieves minimax near-optimal (up to logarithmic terms) estimation by iteratively updating the coefficient vector and the contamination vector with different thresholding scales. We further demonstrate that our AC-IHT estimator is signal-adaptive: under proper signal conditions, it adaptively attains a sharper estimation rate and more accurate support recovery. Moreover, it enjoys the strong oracle property, laying a theoretical foundation for asymptotic inference. Numerical experiments confirm its superior finite-sample performance. Finally, we discuss theoretical extensions of the proposed procedure to generalized linear models and to heavy-tailed noise settings.

artificial intelligence, data mining, machine learning, (19 more...)

2606.27685

Genre: Research Report > New Finding (0.65)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science > Data Mining (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.62)

arXiv.org Machine LearningJun-25-2026

Information from coincidences

Balsubramani, Akshay

We prove a single algebraic mixed coincidence identity that unifies a broad swath of information-theoretic variational results. For any family of priors $\{π_i\}$ and real exponents $\{ α_i \}$, the log of the mixed count $E_{x\simν}\!\left[\prod_{i=1}^W π_i^{α_i}(x)\right]$ is simultaneously a Boltzmann coincidence weight, an exponential-family normalizer, a maximum-entropy value, and a KL-barycenter optimum. The identity yields a unified derivation of classical cornerstones of information theory: concentration of empirical distributions (Sanov-type decompositions and Gibbs conditioning), hypothesis-testing error exponents (Chernoff information and its multi-way analogue), change-of-measure inequalities (Donsker-Varadhan and PAC-Bayes), and laws governing rare-pattern coincidences (Erdos-Renyi run-length, iterative guesswork, rate-distortion, and birthday thresholds). Each is recovered as a specialization of the same algebraic equality. It strictly generalizes the classical Renyi entropy and divergence variational formulas (one and two priors respectively) to a $W$-prior simplex, and holds for unnormalized and continuum-indexed priors. Among its consequences are an exact multi-prior PAC-Bayes penalty that subtracts an explicit "coincidence bonus" from the usual single-prior posterior penalty, and the asymptotic MAP error exponent for $W$-ary hypothesis testing as an edge-restricted simplex optimum. We demonstrate the calculus at scale on two large alphabets encoding richly modeled sequential languages: on language-model next-token predictives where we recover contrastive decoding, and on human genomic regulatory sequence where it separates correlated from diverse prior families along a sliding-window trace.

artificial intelligence, machine learning, natural language, (19 more...)

2606.25042

Country: Europe (0.45)

Genre: Research Report > New Finding (0.45)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Leisure & Entertainment (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.45)

Neural Information Processing SystemsJun-23-2026, 04:47:54 GMT

Optimal Spectral Transitions in High-Dimensional Multi-Index Models

We consider the problem of how many samples from a Gaussian multi-index model are required to weakly reconstruct the relevant index subspace. Despite its increasing popularity as a testbed for investigating the computational complexity of neural networks, results beyond the single-index setting remain elusive. In this work, we introduce spectral algorithms based on the linearization of a message passing scheme tailored to this problem. Our main contribution is to show that the proposed methods achieve the optimal reconstruction threshold. Leveraging a high-dimensional characterization of the algorithms, we show that above the critical threshold the leading eigenvector correlates with the relevant index subspace, a phenomenon reminiscent of the Baik-Ben Arous-Peche (BBP) transition in spiked models arising in random matrix theory.

artificial intelligence, eigenvalue, machine learning, (15 more...)

Country:

North America > United States (0.28)
Europe > France (0.28)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.24)

Genre: Research Report > Experimental Study (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)

Neural Information Processing SystemsJun-23-2026, 04:47:05 GMT

Tight Asymptotics of Extreme Order Statistics

A classic statistical problem is to study the asymptotic behavior of the order statistics of a large number of independent samples taken from a distribution with finite expectation. This behavior has implications for several core problems in machine learning and economics -- including robust learning under adversarial noise, best-arm identification in bandit algorithms, revenue estimation in secondprice auctions, and the analysis of tail-sensitive statistics used in out-of-distribution detection. The research question we tackle in this paper is: How large can the expectation of the ℓ-th maximum of the n samples be? For ℓ = 1, i.e., the maximum, this expectation is known to grow as o(n), which can be shown to be tight. We show that there is a sharp contrast when considering any fixed ℓ > 1. Surprisingly, in

artificial intelligence, data mining, machine learning, (19 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.66)

Neural Information Processing SystemsJun-23-2026, 04:21:17 GMT

Discovering Opinion Intervals from Conflicts in Signed Graphs

Peter Blohm, Florian Chen, Aristides Gionis, Stefan Neumann

Online social media provide a platform for people to discuss current events and exchange opinions with their peers. While interactions are predominantly positive, in recent years, there has been a lot of research to understand the conflicts in social networks and how they are based on different views and opinions. In this paper, we ask whether the conflicts in a network reveal a small and interpretable set of prevalent opinion ranges that explain the users' interactions. More precisely, we consider signed graphs, where the edge signs indicate positive and negative interactions of node pairs, and our goal is to infer opinion intervals that are consistent with the edge signs.

artificial intelligence, machine learning, vertex, (17 more...)

Country:

North America > United States (0.67)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.92)

Industry:

Government (0.93)
Information Technology > Services (0.34)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Neural Information Processing SystemsJun-23-2026, 03:11:44 GMT

Beyond Least Squares: Uniform Approximation and the Hidden Cost of Misspecification

We study the problem of controlling worst-case errors in misspecified linear regression under the random design setting, where the regression function is estimated via (penalized) least-squares. This setting arises naturally in value function approximation for bandit algorithms and reinforcement learning (RL). Our first main contribution is the observation that the amplification of the misspecification error when using least-squares is governed by the Lebesgue constant, a classical quantity from approximation theory that depends on the choice of the feature subspace and the covariate distribution. We also show that this dependence on the misspecification error is tight for least-squares regression: in general, no method minimizing the empirical squared loss, including regularized least-squares, can improve it substantially. We argue this explains the empirical observation that some feature-maps (e.g., those derived from the Fourier bases) "work better in RL" than others (e.g., polynomials): given some covariate distribution, the Lebesgue constant is known to be highly sensitive to choice of the feature-map. As a second contribution, we propose a method that augments the original feature set with auxiliary features designed to reduce the error amplification. We then prove that the method successfully competes with an "oracle" that knows the best way of using the auxiliary features to reduce this amplification. For example, when the domain is a real interval and the features are monomials, our method reduces the amplification factor to O(1)as d, while without our method, least-squares with the monomials (and in fact polynomials) will suffer a worst-case error amplification of order Ω(d). It follows that there are functions and feature maps for which our method is consistent, while least-squares is inconsistent.

artificial intelligence, lebesgue constant, machine learning, (17 more...)

Country:

Europe (0.28)
North America (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry: Banking & Finance (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)